{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Dataflowr](https://raw.githubusercontent.com/dataflowr/website/master/_assets/dataflowr_logo.png)](https://dataflowr.github.io/website/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "cKEoXRSzUUCv"
   },
   "source": [
    "# Collaborative filtering\n",
    "-----\n",
    "\n",
    "In this example, we'll build a quick explicit feedback recommender system: that is, a model that takes into account explicit feedback signals (like ratings) to recommend new content.\n",
    "\n",
    "We'll use an approach first made popular by the [Netflix prize](https://en.wikipedia.org/wiki/Netflix_Prize) contest: [matrix factorization](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf). \n",
    "\n",
    "The basic idea is very simple:\n",
    "\n",
    "1. Start with user-item-rating triplets, conveying the information that user _i_ gave some item _j_ rating _r_.\n",
    "2. Represent both users and items as high-dimensional vectors of numbers. For example, a user could be represented by `[0.3, -1.2, 0.5]` and an item by `[1.0, -0.3, -0.6]`.\n",
    "3. The representations should be chosen so that, when we multiplied together (via [dot products](https://en.wikipedia.org/wiki/Dot_product)), we can recover the original ratings.\n",
    "4. The utility of the model then is derived from the fact that if we multiply the user vector of a user with the item vector of some item they _have not_ rated, we hope to obtain a prediction for the rating they would have given to it had they seen it.\n",
    "\n",
    "![collaborative filtering](matrix_factorization.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "OeHGO4qbUUC4"
   },
   "source": [
    "## 1. Preparations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "frj5rX9wUUC_"
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import os.path as op\n",
    "\n",
    "from zipfile import ZipFile\n",
    "try:\n",
    "    from urllib.request import urlretrieve\n",
    "except ImportError:  # Python 2 compat\n",
    "    from urllib import urlretrieve\n",
    "\n",
    "# this line needs to be modified if not on colab:\n",
    "data_folder = '/content/'\n",
    "\n",
    "ML_100K_URL = \"http://files.grouplens.org/datasets/movielens/ml-100k.zip\"\n",
    "ML_100K_FILENAME = op.join(data_folder,ML_100K_URL.rsplit('/', 1)[1])\n",
    "ML_100K_FOLDER = op.join(data_folder,'ml-100k')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "pK7DyF_ZUUDQ"
   },
   "source": [
    "We start with importing a famous dataset, the [Movielens 100k dataset](https://grouplens.org/datasets/movielens/100k/). It contains 100,000 ratings (between 1 and 5) given to 1682 movies by 943 users:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tKfA7pagUUDT"
   },
   "outputs": [],
   "source": [
    "if not op.exists(ML_100K_FILENAME):\n",
    "    print('Downloading %s to %s...' % (ML_100K_URL, ML_100K_FILENAME))\n",
    "    urlretrieve(ML_100K_URL, ML_100K_FILENAME)\n",
    "\n",
    "if not op.exists(ML_100K_FOLDER):\n",
    "    print('Extracting %s to %s...' % (ML_100K_FILENAME, ML_100K_FOLDER))\n",
    "    ZipFile(ML_100K_FILENAME).extractall(data_folder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "LKXSFZpUUUDl"
   },
   "source": [
    "Other datasets, see: [Movielens](https://grouplens.org/datasets/movielens/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hHgLg4jBUUDp"
   },
   "source": [
    "## 2. Data analysis and formating"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZlGyPSBpUUDt"
   },
   "source": [
    "[Python Data Analysis Library](http://pandas.pydata.org/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "QFRB1aAPUUDy"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "all_ratings = pd.read_csv(op.join(ML_100K_FOLDER, 'u.data'), sep='\\t',\n",
    "                          names=[\"user_id\", \"item_id\", \"ratings\", \"timestamp\"])\n",
    "all_ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "enyJuYhUUUEL"
   },
   "source": [
    "Let's check out a few macro-stats of our dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "oS_aeGxwj62g"
   },
   "outputs": [],
   "source": [
    "list_movies_names = []\n",
    "list_item_ids = []\n",
    "with open(op.join(ML_100K_FOLDER, 'u.item'), encoding = \"ISO-8859-1\") as fp:\n",
    "    for line in fp:\n",
    "        list_item_ids.append(line.split('|')[0])\n",
    "        list_movies_names.append(line.split('|')[1])\n",
    "        \n",
    "movies_names = pd.DataFrame(list(zip(list_item_ids, list_movies_names)), \n",
    "               columns =['item_id', 'item_name']) \n",
    "movies_names.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "NH-kq2rHj62j"
   },
   "outputs": [],
   "source": [
    "movies_names['item_id']=movies_names['item_id'].astype(int)\n",
    "all_ratings['item_id']=all_ratings['item_id'].astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "kkFnwTUTj62m"
   },
   "outputs": [],
   "source": [
    "all_ratings = all_ratings.merge(movies_names,on='item_id')\n",
    "all_ratings.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "3M7-jinQUUEO"
   },
   "outputs": [],
   "source": [
    "#number of entries\n",
    "len(all_ratings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ofVcE781UUEa"
   },
   "outputs": [],
   "source": [
    "all_ratings['ratings'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tCdV2qzMUUEk"
   },
   "outputs": [],
   "source": [
    "# number of unique rating values\n",
    "len(all_ratings['ratings'].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "zU9vARwJUUEs"
   },
   "outputs": [],
   "source": [
    "all_ratings['user_id'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "P1QoS71CUUE7"
   },
   "outputs": [],
   "source": [
    "# number of unique users\n",
    "total_user_id = len(all_ratings['user_id'].unique())\n",
    "print(total_user_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "l3gfokbNUUFH"
   },
   "outputs": [],
   "source": [
    "all_ratings['item_id'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "pE6-aj0KUUFY"
   },
   "outputs": [],
   "source": [
    "# number of unique rated items\n",
    "total_item_id = len(all_ratings['item_id'].unique())\n",
    "print(total_item_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "AiyZLRqVj627"
   },
   "outputs": [],
   "source": [
    "all_ratings['item_id'] = all_ratings['item_id'].apply(lambda x :x-1)\n",
    "all_ratings['user_id'] = all_ratings['user_id'].apply(lambda x :x-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "oiujn3oZj629"
   },
   "outputs": [],
   "source": [
    "movies_names['item_id']=movies_names['item_id'].apply(lambda x: x-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "FH6lv18Oj63A"
   },
   "outputs": [],
   "source": [
    "movies_names=movies_names.set_index('item_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "1d_laQhNj63D"
   },
   "outputs": [],
   "source": [
    "movies_names.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Ur4bjuniUUFj"
   },
   "source": [
    "For spliting the data into _train_ and _test_ we'll be using a pre-defined function from [scikit-learn](http://scikit-learn.org/stable/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ZT5oxhoGUUFm"
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "ratings_train, ratings_test = train_test_split(\n",
    "    all_ratings, test_size=0.2, random_state=42)\n",
    "\n",
    "user_id_train = ratings_train['user_id']\n",
    "item_id_train = ratings_train['item_id']\n",
    "rating_train = ratings_train['ratings']\n",
    "\n",
    "user_id_test = ratings_test['user_id']\n",
    "item_id_test = ratings_test['item_id']\n",
    "rating_test = ratings_test['ratings']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "aLem0iAjUUFs"
   },
   "outputs": [],
   "source": [
    "len(user_id_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "45HhGRbFUUF2"
   },
   "outputs": [],
   "source": [
    "len(user_id_train.unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "z9m6gSoKUUF_"
   },
   "outputs": [],
   "source": [
    "len(item_id_train.unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Vmx1YZmTUUGG"
   },
   "source": [
    "We see that all the movies are not rated in the train set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "TYJk2PQmj63M"
   },
   "outputs": [],
   "source": [
    "movies_not_train = (set(all_ratings['item_id']) -set(item_id_train))\n",
    "for m in movies_not_train:\n",
    "    print(m,movies_names.loc[m]['item_name'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "TOY6YCqPUUGI"
   },
   "outputs": [],
   "source": [
    "user_id_train.iloc[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "6ZCYvrYhUUGP"
   },
   "outputs": [],
   "source": [
    "item_id_train.iloc[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qZsssRYSUUGW"
   },
   "outputs": [],
   "source": [
    "rating_train.iloc[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Q7JF-d05UUGc"
   },
   "source": [
    "## 3. The model\n",
    "\n",
    "We can feed our dataset to the `FactorizationModel` class - a sklearn-like object that allows us to train and evaluate the explicit factorization models.\n",
    "\n",
    "Internally, the model uses the `Model_dot`(class to represents users and items. It's composed of a 4 `embedding` layers:\n",
    "\n",
    "- a `(num_users x latent_dim)` embedding layer to represent users,\n",
    "- a `(num_items x latent_dim)` embedding layer to represent items,\n",
    "- a `(num_users x 1)` embedding layer to represent user biases, and\n",
    "- a `(num_items x 1)` embedding layer to represent item biases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tsrfFi1QUUGd"
   },
   "outputs": [],
   "source": [
    "import torch.nn as nn\n",
    "import torch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "iaMRfQoCj63W"
   },
   "outputs": [],
   "source": [
    "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Vaf7-kdnUUGh"
   },
   "source": [
    "Let's generate [Embeddings](http://pytorch.org/docs/master/nn.html#embedding) for the users, _i.e._ a fixed-sized vector describing the user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "2iu-1X4AUUGi"
   },
   "outputs": [],
   "source": [
    "embedding_dim = 3\n",
    "embedding_user = nn.Embedding(total_user_id, embedding_dim)\n",
    "input = torch.LongTensor([[1,2,4,5],[4,3,2,0]])\n",
    "embedding_user(input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HpkKu7PVUUGp"
   },
   "source": [
    "We will use some custom embeddings and dataloader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "eB6_y1nMUUGq"
   },
   "outputs": [],
   "source": [
    "class ScaledEmbedding(nn.Embedding):\n",
    "    \"\"\"\n",
    "    Embedding layer that initialises its values\n",
    "    to using a normal variable scaled by the inverse\n",
    "    of the embedding dimension.\n",
    "    \"\"\"\n",
    "    def reset_parameters(self):\n",
    "        \"\"\"\n",
    "        Initialize parameters.\n",
    "        \"\"\"\n",
    "        self.weight.data.normal_(0, 1.0 / self.embedding_dim)\n",
    "        if self.padding_idx is not None:\n",
    "            self.weight.data[self.padding_idx].fill_(0)\n",
    "\n",
    "\n",
    "class ZeroEmbedding(nn.Embedding):\n",
    "    \"\"\"\n",
    "    Used for biases.\n",
    "    \"\"\"\n",
    "    def reset_parameters(self):\n",
    "        \"\"\"\n",
    "        Initialize parameters.\n",
    "        \"\"\"\n",
    "        self.weight.data.zero_()\n",
    "        if self.padding_idx is not None:\n",
    "            self.weight.data[self.padding_idx].fill_(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ktXRW3-4UUGt"
   },
   "outputs": [],
   "source": [
    "class DotModel(nn.Module):\n",
    "    \n",
    "    def __init__(self,\n",
    "                 num_users,\n",
    "                 num_items,\n",
    "                 embedding_dim=32):\n",
    "        \n",
    "        super(DotModel, self).__init__()\n",
    "        \n",
    "        self.embedding_dim = embedding_dim\n",
    "        \n",
    "        self.user_embeddings = ScaledEmbedding(num_users, embedding_dim)\n",
    "        self.item_embeddings = ScaledEmbedding(num_items, embedding_dim)\n",
    "        self.user_biases = ZeroEmbedding(num_users, 1)\n",
    "        self.item_biases = ZeroEmbedding(num_items, 1)\n",
    "                \n",
    "        \n",
    "    def forward(self, user_ids, item_ids):\n",
    "        \n",
    "        #\n",
    "        # your code here\n",
    "        #\n",
    "        return \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "PlbbwU1ze5d7"
   },
   "outputs": [],
   "source": [
    "net = DotModel(total_user_id,total_item_id).to(device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ysX9Q9pxiMG4"
   },
   "source": [
    "Now test your network on a small batch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Cp2ZOCFhfERq"
   },
   "outputs": [],
   "source": [
    "batch_users_np = user_id_train.values[:5].astype(np.int32)\n",
    "batch_items_np = item_id_train.values[:5].astype(np.int32)\n",
    "batch_ratings_np = rating_train[:5].values.astype(np.float32)\n",
    "batch_users_tensor = torch.LongTensor(batch_users_np).to(device)\n",
    "batch_items_tensor = torch.LongTensor(batch_items_np).to(device)\n",
    "batch_ratings_tensor = torch.tensor(batch_ratings_np).to(device)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "8K4btQb2gsAg"
   },
   "outputs": [],
   "source": [
    "predictions = net(batch_users_tensor,batch_items_tensor)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "I20UhBLZjBEs"
   },
   "outputs": [],
   "source": [
    "predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZGG3SBS9jizs"
   },
   "source": [
    "We will use MSE loss defined below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "7sqlTLCyiwzl"
   },
   "outputs": [],
   "source": [
    "def regression_loss(predicted_ratings, observed_ratings):\n",
    "    return ((observed_ratings - predicted_ratings) ** 2).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "wK5LqsRXi0_G"
   },
   "outputs": [],
   "source": [
    "loss_fn = regression_loss\n",
    "loss = loss_fn(predictions, batch_ratings_tensor)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Ins38kThjNl0"
   },
   "outputs": [],
   "source": [
    "loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "MPpfdFaikudZ"
   },
   "source": [
    "Check that your network is learning by overfitting your network on this small batch (you should reach a loss below 0.5 in the cell below)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "TcIY0t-xkFwT"
   },
   "outputs": [],
   "source": [
    "net = DotModel(total_user_id,total_item_id).to(device)\n",
    "optimizer = torch.optim.Adam(net.parameters(), lr = 0.1)\n",
    "for e in range(15):\n",
    "    #\n",
    "    # your code here\n",
    "    #"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "CYeBlERRUUG0"
   },
   "outputs": [],
   "source": [
    "def shuffle(*arrays):\n",
    "\n",
    "    random_state = np.random.RandomState()\n",
    "    shuffle_indices = np.arange(len(arrays[0]))\n",
    "    random_state.shuffle(shuffle_indices)\n",
    "\n",
    "    if len(arrays) == 1:\n",
    "        return arrays[0][shuffle_indices]\n",
    "    else:\n",
    "        return tuple(x[shuffle_indices] for x in arrays)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "OmlQTG2FUUG4"
   },
   "outputs": [],
   "source": [
    "def minibatch(batch_size, *tensors):\n",
    "\n",
    "    if len(tensors) == 1:\n",
    "        tensor = tensors[0]\n",
    "        for i in range(0, len(tensor), batch_size):\n",
    "            yield tensor[i:i + batch_size]\n",
    "    else:\n",
    "        for i in range(0, len(tensors[0]), batch_size):\n",
    "            yield tuple(x[i:i + batch_size] for x in tensors)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "2ekPPr7SUUG7"
   },
   "outputs": [],
   "source": [
    "import imp\n",
    "import numpy as np\n",
    "\n",
    "import torch.optim as optim\n",
    "\n",
    "class FactorizationModel(object):\n",
    "    \n",
    "    def __init__(self, embedding_dim=32, n_iter=10, batch_size=256, l2=0.0,\n",
    "                 learning_rate=1e-2, device=device, net=None, num_users=None,\n",
    "                 num_items=None,random_state=None):\n",
    "        \n",
    "        self._embedding_dim = embedding_dim\n",
    "        self._n_iter = n_iter\n",
    "        self._learning_rate = learning_rate\n",
    "        self._batch_size = batch_size\n",
    "        self._l2 = l2\n",
    "        self._device = device\n",
    "        self._num_users = num_users\n",
    "        self._num_items = num_items\n",
    "        self._net = net\n",
    "        self._optimizer = None\n",
    "        self._loss_func = None\n",
    "        self._random_state = random_state or np.random.RandomState()\n",
    "             \n",
    "        \n",
    "    def _initialize(self):\n",
    "        if self._net is None:\n",
    "            self._net = DotModel(self._num_users, self._num_items, self._embedding_dim).to(self._device)\n",
    "        \n",
    "        self._optimizer = optim.Adam(\n",
    "                self._net.parameters(),\n",
    "                lr=self._learning_rate,\n",
    "                weight_decay=self._l2\n",
    "            )\n",
    "        \n",
    "        self._loss_func = regression_loss\n",
    "    \n",
    "    @property\n",
    "    def _initialized(self):\n",
    "        return self._optimizer is not None\n",
    "    \n",
    "    \n",
    "    def fit(self, user_ids, item_ids, ratings, verbose=True):\n",
    "        \n",
    "        user_ids = user_ids.astype(np.int64)\n",
    "        item_ids = item_ids.astype(np.int64)\n",
    "        \n",
    "        if not self._initialized:\n",
    "            self._initialize()\n",
    "            \n",
    "        for epoch_num in range(self._n_iter):\n",
    "            users, items, ratingss = shuffle(user_ids,\n",
    "                                            item_ids,\n",
    "                                            ratings)\n",
    "\n",
    "            user_ids_tensor = torch.from_numpy(users).to(self._device)\n",
    "            item_ids_tensor = torch.from_numpy(items).to(self._device)\n",
    "            ratings_tensor = torch.from_numpy(ratingss).to(self._device)\n",
    "            epoch_loss = 0.0\n",
    "\n",
    "            for (minibatch_num,\n",
    "                 (batch_user,\n",
    "                  batch_item,\n",
    "                  batch_rating)) in enumerate(minibatch(self._batch_size,\n",
    "                                                         user_ids_tensor,\n",
    "                                                         item_ids_tensor,\n",
    "                                                         ratings_tensor)):\n",
    "                \n",
    "                \n",
    "                # beging to be completed\n",
    "                predictions = \n",
    "                #\n",
    "                loss = \n",
    "                epoch_loss = \n",
    "                #\n",
    "                #\n",
    "                # end to be completed\n",
    "            \n",
    "            epoch_loss = epoch_loss / (minibatch_num + 1)\n",
    "            \n",
    "            if verbose:\n",
    "                print('Epoch {}: loss_train {}'.format(epoch_num, epoch_loss))\n",
    "        \n",
    "            if np.isnan(epoch_loss) or epoch_loss == 0.0:\n",
    "                raise ValueError('Degenerate epoch loss: {}'\n",
    "                                 .format(epoch_loss))\n",
    "    \n",
    "    \n",
    "    def test(self,user_ids, item_ids, ratings):\n",
    "        self._net.train(False)\n",
    "        user_ids = user_ids.astype(np.int64)\n",
    "        item_ids = item_ids.astype(np.int64)\n",
    "        \n",
    "        user_ids_tensor = torch.from_numpy(user_ids).to(self._device)\n",
    "        item_ids_tensor = torch.from_numpy(item_ids).to(self._device)\n",
    "        ratings_tensor = torch.from_numpy(ratings).to(self._device)\n",
    "               \n",
    "        predictions = self._net(user_ids_tensor, item_ids_tensor)\n",
    "        \n",
    "        loss = self._loss_func(ratings_tensor, predictions)\n",
    "        return loss.data.item()\n",
    "\n",
    "    def predict(self,user_ids, item_ids):\n",
    "        self._net.train(False)\n",
    "        user_ids = user_ids.astype(np.int64)\n",
    "        item_ids = item_ids.astype(np.int64)\n",
    "        \n",
    "        user_ids_tensor = torch.from_numpy(user_ids).to(self._device)\n",
    "        item_ids_tensor = torch.from_numpy(item_ids).to(self._device)\n",
    "               \n",
    "        predictions = self._net(user_ids_tensor, item_ids_tensor)\n",
    "        return predictions.data   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qGuzPUNrUUG_"
   },
   "outputs": [],
   "source": [
    "model = FactorizationModel(embedding_dim=50,  # latent dimensionality\n",
    "                                   n_iter=5,  # number of epochs of training\n",
    "                                   batch_size=1024,  # minibatch size\n",
    "                                   learning_rate=1e-3,\n",
    "                                   l2=1e-9,  # strength of L2 regularization\n",
    "                                   num_users=total_user_id,\n",
    "                                   num_items=total_item_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "mMKyTn4QUUHC"
   },
   "outputs": [],
   "source": [
    "user_ids_train_np = user_id_train.values.astype(np.int32)\n",
    "item_ids_train_np = item_id_train.values.astype(np.int32)\n",
    "ratings_train_np = rating_train.values.astype(np.float32)\n",
    "user_ids_test_np = user_id_test.values.astype(np.int64)\n",
    "item_ids_test_np = item_id_test.values.astype(np.int64)\n",
    "ratings_test_np = rating_test.values.astype(np.float32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "2Bd2CIosUUHJ"
   },
   "outputs": [],
   "source": [
    "model.fit(user_ids_train_np, item_ids_train_np, ratings_train_np)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "HKyc7Bg5UUHR"
   },
   "outputs": [],
   "source": [
    "model.test(user_ids_test_np, item_ids_test_np, ratings_test_np  )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "W9BIV9WaUUHN"
   },
   "outputs": [],
   "source": [
    "print(model._net)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Ltvw1o_dVjbm"
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "\n",
    "test_preds = model.predict(user_ids_test_np, item_ids_test_np)\n",
    "print(\"Final test RMSE: %0.3f\" % np.sqrt(mean_squared_error(test_preds.cpu(), ratings_test_np)))\n",
    "print(\"Final test MAE: %0.3f\" % mean_absolute_error(test_preds.cpu(), ratings_test_np))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "yXVIF75IUUHW"
   },
   "source": [
    "You can compare with [Surprise](https://github.com/NicolasHug/Surprise)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "MwQ3gXyuUUHY"
   },
   "source": [
    "## 4. Best and worst movies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "RxwaryZZUUHY"
   },
   "source": [
    "Getting the name of the movies (there must be a better way, please provide alternate solutions!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "I-hyp9IpUUHa"
   },
   "outputs": [],
   "source": [
    "list_movies_names = []\n",
    "list_item_ids = []\n",
    "with open(op.join(ML_100K_FOLDER, 'u.item'), encoding = \"ISO-8859-1\") as fp:\n",
    "    for line in fp:\n",
    "        list_item_ids.append(line.split('|')[0])\n",
    "        list_movies_names.append(line.split('|')[1])\n",
    "        \n",
    "movies_names = pd.DataFrame(list(zip(list_item_ids, list_movies_names)), \n",
    "               columns =['item_id', 'item_name']) \n",
    "movies_names.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "B8aAYBuhUUHf"
   },
   "outputs": [],
   "source": [
    "item_bias_np = model._net.item_biases.weight.data.cpu().numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ZhtanoCiUUHi"
   },
   "outputs": [],
   "source": [
    "movies_names['biases'] = pd.Series(item_bias_np.T[0], index=movies_names.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "2TAx4Tz6UUHk"
   },
   "outputs": [],
   "source": [
    "movies_names.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "gn8gcmQjUUHo"
   },
   "outputs": [],
   "source": [
    "movies_names.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Q6ZmJYIdUUHt"
   },
   "outputs": [],
   "source": [
    "indices_item_train = np.sort(item_id_train.unique())\n",
    "movies_names = movies_names.loc[indices_item_train]\n",
    "movies_names.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "iUqELsLPUUH1"
   },
   "outputs": [],
   "source": [
    "movies_names = movies_names.sort_values(ascending=False,by=['biases'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "7TG247DXUUH4"
   },
   "source": [
    "Best movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ZzQGi5VIUUH5"
   },
   "outputs": [],
   "source": [
    "movies_names.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Y3fHTO-eUUH_"
   },
   "source": [
    "Worse movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "1-EhFYOwUUIA"
   },
   "outputs": [],
   "source": [
    "movies_names.tail(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "t_CPwQ2bj64P"
   },
   "source": [
    "## 5. PCA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "4WB_IZtPj64P"
   },
   "outputs": [],
   "source": [
    "item_emb_np = model._net.item_embeddings.weight.data.cpu().numpy()\n",
    "item_emb_np.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Uqe-QXDJj64R"
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "from operator import itemgetter\n",
    "\n",
    "pca = PCA(n_components=3)\n",
    "latent_fac = pca.fit_transform(item_emb_np)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "SArVAj3tj64U"
   },
   "outputs": [],
   "source": [
    "movie_comp = [(f, i) for f,i in zip(latent_fac[:,1], list_movies_names)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ZC9L-xPcj64V"
   },
   "outputs": [],
   "source": [
    "sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Ua6iU6TRj64W"
   },
   "outputs": [],
   "source": [
    "sorted(movie_comp, key=itemgetter(0), reverse=False)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "uWyiIsPvj64X"
   },
   "outputs": [],
   "source": [
    "g = all_ratings.groupby('item_name')['ratings'].count()\n",
    "most_rated_movies = g.sort_values(ascending=False).index.values[:1000]\n",
    "most_rated_movies[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "zWAKDHvEj64Y"
   },
   "outputs": [],
   "source": [
    "idxs = range(50)\n",
    "txt_movies_names = most_rated_movies[:len(idxs)]\n",
    "X = latent_fac[idxs,0]\n",
    "Y = latent_fac[idxs,2]\n",
    "plt.figure(figsize=(15,15))\n",
    "plt.scatter(X, Y)\n",
    "for i, x, y in zip(txt_movies_names, X, Y):\n",
    "    plt.text(x+0.01,y-0.01,i, fontsize=11)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XwOYSxcZUUII"
   },
   "source": [
    "## 6. SPOTLIGHT\n",
    "\n",
    "The code written above is a simplified version of [SPOTLIGHT](https://github.com/maciejkula/spotlight)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-nPXmJ_nUUIJ"
   },
   "source": [
    "Once you installed it with: `conda install -c maciejkula -c pytorch spotlight=0.1.5`, you can compare the results..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "aCEqMg9WUUIK"
   },
   "source": [
    "[![Dataflowr](https://raw.githubusercontent.com/dataflowr/website/master/_assets/dataflowr_logo.png)](https://dataflowr.github.io/website/)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "include_colab_link": false,
   "name": "08_collaborative_filtering.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "dldiy",
   "language": "python",
   "name": "dldiy"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
